156 research outputs found
Experimental Evaluation of Cache-Related Preemption Delay Aware Timing Analysis
In the presence of caches, preemptive scheduling may incur a significant overhead referred to as cache-related preemption delay (CRPD). CRPD is caused by preempting tasks evicting cached memory blocks of preempted tasks, which have to be reloaded when the preempted tasks resume their execution.
In this paper we experimentally evaluate state-of-the-art techniques to account for the CRPD during timing analysis. We find that purely synthetically-generated task sets may yield misleading conclusions regarding the relative precision of different CRPD analysis techniques and the impact of CRPD on schedulability in general. Based on task characterizations obtained by static worst-case execution time (WCET) analysis, we shed new light on the state of the art
Making Dynamic Memory Allocation Static to Support WCET Analysis
Current worst-case execution time (WCET) analyses do not support programs using dynamic memory allocation. This is mainly due to the unpredictable cache performance when standard memory allocators
are used. We present algorithms to compute a static allocation for programs using dynamic memory allocation. Our algorithms strive to produce static allocations that lead to minimal WCET times in a subsequent WCET analyses. Preliminary experiments suggest that static allocations for hard real-time applications can be computed at reasonable computational costs
Cache-Related Preemption Delay Computation for Set-Associative Caches - Pitfalls and Solutions
In preemptive real-time systems, scheduling analyses need - in addition to the worst-case execution time - the context-switch cost. In case of preemption, the preempted and the preempting task may interfere on the cache memory. These interferences lead to additional reloads in the preempted task. The delay due to these reloads is referred to as the cache-related preemption delay (CRPD). The CRPD constitutes a large part of the context-switch cost. In this article, we focus on the computation of upper bounds on the CRPD based on the concepts of useful cache blocks (UCBs) and evicting cache blocks (ECBs). We explain how these concepts can be used to bound the CRPD in case of direct-mapped caches. Then we consider set-associative caches with LRU, FIFO, and PLRU replacement. We show potential pitfalls when using UCBs and ECBs to bound the CRPD in case of LRU and
demonstrate that neither UCBs nor ECBs can be used to bound the CRPD in case of FIFO and PLRU. Finally, we sketch a new approach to circumvent these limitations by using the concept of relative competitiveness
uiCA : Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures
Performance models that statically predict the steady-state throughput of basic blocks on particular microarchitectures, such as IACA,
Ithemal, llvm-mca, OSACA, or CQA, can guide optimizing compilers and aid manual software optimization. However, their utility
heavily depends on the accuracy of their predictions. The average
error of existing models compared to measurements on the actual
hardware has been shown to lie between 9% and 36%. But how
good is this? To answer this question, we propose an extremely
simple analytical throughput model that may serve as a baseline.
Surprisingly, this model is already competitive with the state of the
art, indicating that there is significant potential for improvement.
To explore this potential, we develop a simulation-based throughput predictor. To this end, we propose a detailed parametric pipeline
model that supports all Intel Core microarchitecture generations
released between 2011 and 2021. We evaluate our predictor on an
improved version of the BHive benchmark suite and show that
its predictions are usually within 1% of measurement results, improving upon prior models by roughly an order of magnitude. The
experimental evaluation also demonstrates that several microarchitectural details considered to be rather insignificant in previous
work, are in fact essential for accurate prediction.
Our throughput predictor is available as open source
uiCA : Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures
Performance models that statically predict the steady-state throughput of basic blocks on particular microarchitectures, such as IACA,
Ithemal, llvm-mca, OSACA, or CQA, can guide optimizing compilers and aid manual software optimization. However, their utility
heavily depends on the accuracy of their predictions. The average
error of existing models compared to measurements on the actual
hardware has been shown to lie between 9% and 36%. But how
good is this? To answer this question, we propose an extremely
simple analytical throughput model that may serve as a baseline.
Surprisingly, this model is already competitive with the state of the
art, indicating that there is significant potential for improvement.
To explore this potential, we develop a simulation-based throughput predictor. To this end, we propose a detailed parametric pipeline
model that supports all Intel Core microarchitecture generations
released between 2011 and 2021. We evaluate our predictor on an
improved version of the BHive benchmark suite and show that
its predictions are usually within 1% of measurement results, improving upon prior models by roughly an order of magnitude. The
experimental evaluation also demonstrates that several microarchitectural details considered to be rather insignificant in previous
work, are in fact essential for accurate prediction.
Our throughput predictor is available as open source
Warping Cache Simulation of Polyhedral Programs
Techniques to evaluate a program’s cache performance fall
into two camps: 1. Traditional trace-based cache simulators
precisely account for sophisticated real-world cache models
and support arbitrary workloads, but their runtime is proportional to the number of memory accesses performed by
the program under analysis. 2. Relying on implicit workload
characterizations such as the polyhedral model, analytical approaches often achieve problem-size-independent runtimes,
but so far have been limited to idealized cache models.
We introduce a hybrid approach, warping cache simulation, that aims to achieve applicability to real-world cache
models and problem-size-independent runtimes. As prior
analytical approaches, we focus on programs in the polyhedral model, which allows to reason about the sequence
of memory accesses analytically. Combining this analytical
reasoning with information about the cache behavior obtained from explicit cache simulation allows us to soundly
fast-forward the simulation. By this process of warping, we
accelerate the simulation so that its cost is often independent
of the number of memory accesses
Response-time analysis for fixed-priority systems with a write-back cache
This paper introduces analyses of write-back caches integrated into response-time analysis for fixed-priority preemptive and non-preemptive scheduling. For each scheduling paradigm, we derive four different approaches to computing the additional costs incurred due to write backs. We show the dominance relationships between these different approaches and note how they can be combined to form a single state-of-the-art approach in each case. The evaluation explores the relative performance of the different methods using a set of benchmarks, as well as making comparisons with no cache and a write-through cache. We also explore the effect of write buffers used to hide the latency of write-through caches. We show that depending upon the depth of the buffer used and the policies employed, such buffers can result in domino effects. Our evaluation shows that even ignoring domino effects, a substantial write buffer is needed to match the guaranteed performance of write-back caches
- …